Please follow the steps below to complete your assignment:
# importing necessary libraries
import pandas as pd
import numpy as np
import sklearn
from scipy import stats
import pyforest
from sklearn.datasets import load_breast_cancer
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from scipy import stats
# Loading the breast cancer dataset from sklearn
df = load_breast_cancer()
df1 = pd.DataFrame(df.data, columns=df.feature_names)
df2 = pd.DataFrame(df.target, columns=["Result"])
df3 = pd.concat([df1,df2], axis = 1)
df3.head()
To check the count of malignant & benign cases.
df3["Result"].value_counts()
# checking datatype of columns before plotting
df3.info()
# As we observed the dtype for column Result was int, we converted it to float with below code
# just so as to bring the dataframe into a common datatype
df3 = df3.astype(float)
df3.info()
# checking the dataset
df3.describe()
# Checking if nan values are present
df3.isna().sum()
There seems to be no missing value. So we don't need to do any missing value imputation.
# This plot shows scatter plots between all columns in terms of bi-variate analysis
sns.pairplot(df3, hue="Result")
As we see, plot of all the features appears difficult to comprehend and very time consuming as well. So we will pick a few features which appear to show certain relation between the picked features.
fig, (ax1,ax2, ax3, ax4) = plt.subplots(1,4, figsize=(20,10))
ax1.set_title('Mean Radius Vs Mean Perimeter')
sns.scatterplot(x=df3['mean radius'], y=df3['mean perimeter'], data=df3, ax=ax1, hue="Result")
ax2.set_title('Mean Concavity Vs Mean Concave points')
sns.scatterplot(x=df3['mean concavity'], y=df3['mean concave points'], data=df3, ax=ax2,hue="Result")
ax3.set_title('Worst Concavity Vs Worst Cancave Points')
sns.scatterplot(x=df3['worst concavity'], y=df3['worst concave points'], data=df3, ax=ax3,hue="Result")
ax4.set_title('Mean Perimeter Vs Worst Perimeter')
sns.scatterplot(x=df3['mean perimeter'], y=df3['worst perimeter'], data=df3, ax=ax4,hue="Result")
From the above 4 plots we could see that the two features under consideration in every plot are positively corelated and exhibit a relation that is directly proportional. For example, with rise in value of mean radius there is a proportional rise in mean perimeter. From this we get an idea that these features exhibit similar behaviour.
fig, (ax1,ax2, ax3, ax4) = plt.subplots(1,4, figsize=(20,6))
ax1.set_title('Worst Perimeter Vs Mean Texture')
sns.scatterplot(x=df3['worst perimeter'], y=df3['mean texture'], data=df3, ax=ax1, hue="Result")
ax2.set_title('Worst Perimeter Vs Worst Fractal Dimension')
sns.scatterplot(x=df3['worst perimeter'], y=df3['worst fractal dimension'], data=df3, ax=ax2,hue="Result")
ax3.set_title('Worst Perimeter Vs Concavity Error')
sns.scatterplot(x=df3['worst perimeter'], y=df3['concavity error'], data=df3, ax=ax3,hue="Result")
ax4.set_title('Worst Radius Vs Cancave Points Error')
sns.scatterplot(x=df3['worst radius'], y=df3['concave points error'], data=df3, ax=ax4,hue="Result")
Here is another 2D scatter plot among a few features plotted against one another. Here we could see that there appears to be a distinction between benign and malignant cases. Just by observing the plots we can say with a great confidence that cases with mean texture beyond 10 and worst perimeter beyond 120 are malignant i.e. 0. Similar is the case for plot 2, 3 & 4. We can roughly draw a line to separate out benign from malignant.
# To plot a set of boxplot on the entire dataframe
sns.set(rc={'figure.figsize':(16,8)}, font_scale=0.9, style='whitegrid')
df3.boxplot(widths = 0.9)
The above boxplot though gives us an overall picture, is not very readable. Hence we will need to plot the features one by one as follows.
global new_df # declaring it as global so as to be able to
#use this var within local scopes of functions in future
new_df = df3.copy() # Making a copy of the original dataframe for ease
df_0 = new_df[new_df['Result'] == 0]
df_1 = new_df[new_df['Result'] == 1]
fig = plt.figure(figsize=(20,20))
#
for i,b in enumerate(list(new_df.columns[0:30])):
i +=1
ax = fig.add_subplot(6,5,i)
ax.boxplot([df_0[b], df_1[b]])
ax.set_title(b)
sns.set_style("whitegrid")
plt.tight_layout()
plt.legend()
plt.show()
This plot above gives a nice distinct view of boxplots for all the features at hand. We could see there exists outliers in all the features. Hence we will attempt removing outiers in the further cells using IQR method.
Using IQR (Inter Quartile Range) method for outlier removal
def IQR_OutlierRemoval(new_df): # Creating a function for outlier removal using IQR method
Q1 = new_df.quantile(0.25)
Q3 = new_df.quantile(0.75)
IQR = Q3 - Q1
new_df = new_df[~((new_df < (Q1 - 1.5 * IQR)) |(new_df > (Q3 + 1.5 * IQR))).any(axis=1)]
return new_df
# Removing outliers using the function we created before
new_df = IQR_OutlierRemoval(new_df)
print("Shape of the dataframe before outlier removal: ", df3.shape)
print("Shape of the dataframe after outlier removal: ", new_df.shape)
# Plotting a set of boxplot over the entire dataset again post outlier removal
sns.set(rc={'figure.figsize':(16,8)}, font_scale=0.9, style='whitegrid')
new_df.boxplot(widths = 0.9)
We performed outlier removal using Inter Quartile Range method and plotted box plot before outlier removal as well as after outlier removal. We observed that post removing outliers, we end up with a reduces sized dataframe. From a size of 569 rows we arrive at a row size of 398.
This, given we dropped the outlier rows as we asked in the question, could be dealt with better, had we imputed the outlier values with median values of their respective feature columns.
# Using MinMaxScaler to perform normalization
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(new_df)
new_df.loc[:,:] = scaled_values
sns.set(rc={'figure.figsize':(16,8)}, font_scale=0.9, style='whitegrid')
new_df.boxplot(widths = 0.9)
As all the features are of varied ranges, we applied MinMaxScaler here to scale it within 0 & 1. This makes analysis easier as all features now appear to vary within the same range.